TCGA RNA-seq Data Analysis Report¶

Introduction¶

In this report, we describe the data pre-processing steps used to compare the Metabopathia approach with Hipathia. First, a Principal Component Analysis (PCA) and quality control analysis were performed on the RNA-seq data. RNA-seq data were obtained from 12 different cancer types from The Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov/tcga/). For these cancer types, RNA-seq counts were available for both healthy control and cancer samples:

  • Bladder Urothelial Carcinoma (BLCA) [https://doi.org/10.1038/nature12965, https://doi.org/10.1016/j.cell.2017.09.007],
  • Breast invasive carcinoma (BRCA) [https://doi.org/10.1038/nature11412, https://doi.org/10.1016/j.cell.2015.09.033],
  • Colorectal Adenocarcinoma (COAD) [https://doi.org/10.1038/nature11252],
  • Head and Neck squamous cell carcinoma (HNSC) [https://doi.org/10.1038/nature14129],
  • (Kidney) Clear Cell Renal Cell Carcinoma (KIRC) [https://doi.org/10.1038/nature12222],
  • (Kidney) Papillary Renal Cell Carcinomaa (KIRP) [10.1056/NEJMoa1505917],
  • Liver hepatocellular carcinoma (LIHC) [https://doi.org/10.1016/j.cell.2017.05.046],
  • Lung adenocarcinoma (LUAD) [https://doi.org/10.1038/nature13385, https://doi.org/10.1038/ng.3564],
  • Lung squamous cell carcinoma (LUSC) [https://doi.org/10.1038/nature11404, https://doi.org/10.1038/ng.3564],
  • Prostate adenocarcinoma (PRAD) [https://doi.org/10.1016/j.cell.2015.10.025],
  • Thyroid carcinoma (THCA) [https://doi.org/10.1016/j.cell.2014.09.050],
  • Uterine Corpus Endometrioid Carcinoma (UCEC) [https://doi.org/10.1038/nature12113]

Loading Libraries and sourcing needed files¶

In [1]:
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!suppressPackageStartupMessages(require("SummarizedExperiment", quietly = TRUE)))
  BiocManager::install("SummarizedExperiment") # For use in a conda environment, use renv::install("bioc::SummarizedExperiment") instead.
if (!suppressPackageStartupMessages(require("SEtools", quietly = TRUE)))
  BiocManager::install("SEtools") # For use in a conda environment, use renv::install("bioc::SummarizedExperiment") instead.
if (!suppressPackageStartupMessages(require("edgeR", quietly = TRUE)))
  BiocManager::install("edgeR") # For use in a conda environment, use renv::install("bioc::edgeR") instead.
#if (!require("EDASeq"))
#  BiocManager::install("EDASeq") # For use in a conda environment, use renv::install("bioc::EDASeq") instead.
In [2]:
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(library(SEtools))
# suppressPackageStartupMessages(library(EDASeq))
suppressPackageStartupMessages(library(edgeR))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(FactoMineR))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(mixOmics))

Source utils file where functions are pre-definded:

In [3]:
source("utils.R")

Data Acquisition and Preparation¶

As mentioned earlier, this study focuses on RNA-seq data from 12 cancer types. The data were previously downloaded from The Cancer Genome Atlas (TCGA) using a series of shell scripts to automate the process and handle the large dataset efficiently.

Data Download Process¶

To facilitate the download, we used a shell script (get_data_in_parallel.sh) that retrieves data for each cancer type in parallel. This was executed using the Slurm workload manager with the following command (retrieved on June 27, 2024):

sbatch --job-name=getData --mem=200000 --error='.err_parallel.job' --output='.out_parallel.job' get_data_in_parallel.sh {BLCA,BRCA,COAD,HNSC,KIRC,KIRP,LIHC,LUAD,LUSC,PRAD,THCA,UCEC}

sbatch command to submit the job to the Slurm scheduler. --job-name Option specifies the name of the job as getData for example. --mem=200000 Allocates 200 GB of memory for the job, depends on available resources in the cluster! --error and --output: To write error and output logs to specific files (.err_parallel.job and .out_parallel.job).

Script Details¶

  • get_data_in_parallel.sh: This script initiates the download process in parallel for the specified cancer types. You can find the script at the following link: get_data_in_parallel.sh.

Note: This command may require adaptation depending on the resources available in your computing environment. Adjustments to this command will not affect the reproducibility of this report.

Shell Scripts Used¶

  • get_data_in_parallel.sh: This script manages the parallel execution of the data retrieval for each cancer type by calling the next script for each specified type.
    View on GitHub

  • get_data_per_cancer.sh: This script is called by the parallel script to handle the data download for a single cancer type.
    View on GitHub

  • gdc_getData.R: An R script that interacts with the TCGA data portal to fetch the RNA-seq data.
    View on GitHub

Data Storage¶

The downloaded data is organized in a folder named processed_data. Within this folder, there are subfolders for each cancer type containing the following:

  • Temporary files generated during the download process.
  • The key output file: counts?<cancerID>.RData, which contains the RNA-seq counts for each cancer type and will be used in subsequent analysis steps.

By organizing the download process in this manner, we ensure that the data acquisition is efficient, reproducible (data retrieved from TCGA portal on June 27, 2024), and easy to manage. This structure allows us to quickly access and process the data for downstream analysis.

Define Cancer Types and Paths¶

In [4]:
#This cancers was mentioned as used cancers to remove batch effect but are not the same
cancer_list <- c("BLCA","BRCA","COAD","HNSC","KIRC","KIRP","LIHC","LUAD","LUSC","PRAD","THCA","UCEC")
cancer_list_names <- c("Bladder Urothelial Carcinoma",
                       "Breast invasive carcinoma",
                       "Colorectal Adenocarcinoma",
                       "Head and Neck squamous cell carcinoma",
                       "Kidney Clear Cell Renal Cell Carcinoma",
                       "Kidney Papillary Renal Cell Carcinomaa",
                       "Liver hepatocellular carcinoma",
                       "Lung adenocarcinoma",
                       "Lung squamous cell carcinoma",
                       "Prostate adenocarcinoma",
                       "Thyroid carcinoma",
                       "Uterine Corpus Endometrioid Carcinoma")
In [5]:
cancer_list_id <- paste("TCGA", cancer_list, sep = "-")
names(cancer_list_names) <- cancer_list_id
my_dir <- "processed_data/"
In [6]:
data_list <- lapply(cancer_list_id, load_data, my_dir)
In [7]:
merged_data <- mergeSEs(ll = data_list, do.scale = F, use.assays = 'unstranded')
In [29]:
# Save a copy of these merged row counts
saveRDS(object = as_tibble(rownames_to_column(as.data.frame(assay(merged_data)))), 
        file = file.path(my_dir,"unstranded_counts_merged_data_tibble_v1.rds"))
In [30]:
meta_data <- colData(merged_data) %>% as_tibble() %>% dplyr::select(c(barcode, project_id, sample_type))
In [31]:
saveRDS(object = meta_data, file = file.path(my_dir,"counts_merged_metadata_v1.rds"))
In [33]:
all_metaType <- colData(merged_data) %>% as_tibble() %>% dplyr::select(c('sample_type_id','sample_type','specimen_type','tissue_type'))
In [34]:
unique(all_metaType)
A tibble: 6 × 4
sample_type_idsample_typespecimen_typetissue_type
<chr><chr><chr><chr>
01Primary Tumor Solid TissueTumor
11Solid Tissue Normal Solid TissueNormal
06Metastatic Unknown Tumor
02Recurrent Tumor Solid TissueTumor
05Additional - New PrimaryUnknown Tumor
01Primary Tumor Unknown Tumor

Load Previously Merged Data¶

This step is included for cases where we already have the saved merged counts and wish to start directly from them.

In [10]:
version_of_data <- "v1"
In [35]:
merged_countdata <- readRDS(file.path(my_dir,paste0("unstranded_counts_merged_data_tibble_",version_of_data,".rds")))#_sm
In [36]:
dim(merged_countdata)
  1. 60660
  2. 6982

This dataset contains a total of 6,982 samples from 6,251 cases (The number of samples is higher than the number of cases because some patients have more than one sample). Next, we will load the metadata, which contains information about each sample—barcode, project ID, and the type of tissue sample

In [40]:
merged_metaData <- readRDS(paste0("processed_data/counts_merged_metadata_",version_of_data,".rds"))
In [41]:
colnames(merged_metaData)
  1. 'barcode'
  2. 'project_id'
  3. 'sample_type'

Here, we can check the number of samples per project ID.

In [42]:
table(merged_metaData$project_id)
TCGA-BLCA TCGA-BRCA TCGA-COAD TCGA-HNSC TCGA-KIRC TCGA-KIRP TCGA-LIHC TCGA-LUAD 
      431      1231       524       566       614       323       424       600 
TCGA-LUSC TCGA-PRAD TCGA-THCA TCGA-UCEC 
      553       554       572       589 

Here, we can check the number of participants/cases per project ID.

In [44]:
merged_metaData %>%rowwise() %>%mutate(part =unlist(strsplit(barcode, '-'))[3]) %>% .[!(duplicated(.$part)),] %>% dplyr::select(project_id) %>%table()
project_id
TCGA-BLCA TCGA-BRCA TCGA-COAD TCGA-HNSC TCGA-KIRC TCGA-KIRP TCGA-LIHC TCGA-LUAD 
      406      1095       458       521       533       290       371       517 
TCGA-LUSC TCGA-PRAD TCGA-THCA TCGA-UCEC 
      501       497       505       557 

Quality Control and Normalization¶

Filter Lowly Expressed Genes (skipped):¶

This step is skipped because the mechanistic modeling method requires all data, unlike differential expression analysis pipelines. If low expressed genes were removed, the method would impute them with 0.5 (or another value), which would not reflect reality, as these genes are low expressed in reality.

Exploratory analysis¶

Defining batchs and colors¶

Before starting our analysis, let's define a set of colors for each batch to generate several plots that will show the our data clearly. The chosen palette will consider colorblindness as much as possible to ensure the clarity of the results for a wide audience.

First, let's assign colors for each sample type:

In [45]:
unique(merged_metaData$sample_type)
  1. 'Primary Tumor'
  2. 'Solid Tissue Normal'
  3. 'Metastatic'
  4. 'Recurrent Tumor'
  5. 'Additional - New Primary'
In [46]:
# Colors for sample types:
colors <- data.frame("id" = merged_metaData$barcode,
                     "cancer"= factor(merged_metaData$project_id),
                     "type" = factor(merged_metaData$sample_type)) %>%
    mutate(by_type= case_when(type == "Solid Tissue Normal" ~ "#009E73", 
                        type == "Primary Solid Tumor" | type == "Primary Tumor" ~ "#E69F00",
                        type == "Recurrent Solid Tumor" | type == "Recurrent Tumor" ~ "#D55E00",
                        type == "Additional - New Primary" ~ "#0072B2",
                        type == "Metastatic" ~ "#CC79A7"))
rownames(colors)<- colors$id

Then colors by cancer type or project id:

In [47]:
# other colors 
# by cancer 
cols_cancer<-setNames(rainbow(length(levels(colors$cancer))),levels(colors$cancer))
colors <- colors %>%  mutate(by_cancer = cols_cancer[cancer])
In [48]:
#Define batches from barcode
batchs <- do.call(rbind,strsplit(colors$id,"-")) %>% as.data.frame(stringsAsFactors = T)
colnames(batchs) <- c('project', 'tss', 'participant', 'sampleVial', 'portionAnalyte', 'plate', 'center')
In [98]:
rownames(batchs)<- colors$id
batchs$cancer <- colors$cancer
batchs$type <- colors$type
batchs$group <- ifelse(batchs$type =="Solid Tissue Normal", "Normal", "Tumor") %>% as.factor()
In [99]:
head(batchs)
A data.frame: 6 × 10
projecttssparticipantsampleVialportionAnalyteplatecentercancertypegroup
<fct><fct><fct><fct><fct><fct><fct><fct><fct><fct>
TCGA-CU-A3KJ-01A-11R-A21D-07TCGACUA3KJ01A11RA21D07TCGA-BLCAPrimary TumorTumor
TCGA-K4-A3WU-01B-11R-A23N-07TCGAK4A3WU01B11RA23N07TCGA-BLCAPrimary TumorTumor
TCGA-DK-A3IU-01A-11R-A20F-07TCGADKA3IU01A11RA20F07TCGA-BLCAPrimary TumorTumor
TCGA-GV-A40G-01A-11R-A23N-07TCGAGVA40G01A11RA23N07TCGA-BLCAPrimary TumorTumor
TCGA-DK-A3IN-01A-11R-A20F-07TCGADKA3IN01A11RA20F07TCGA-BLCAPrimary TumorTumor
TCGA-SY-A9G0-01A-12R-A38B-07TCGASYA9G001A12RA38B07TCGA-BLCAPrimary TumorTumor
In [50]:
levels(batchs$center)
'07'

According to the GDC TCGA Code Tables, the code 07 corresponds to the following center:

  • Center Name: University of North Carolina
  • Center Type: CGCC
  • Display Name: UNC
In [54]:
##by center -> HAs to be skipped because theres only one center !
cols_center<-setNames(rainbow(length(levels(batchs$center))),levels(batchs$center))
colors <- colors %>%  mutate(by_center = cols_center[batchs$center])
In [55]:
unique(colors$by_center)
'#FF0000'
In [56]:
levels(batchs$plate) %>% length
178
In [57]:
# by plate
cols_plate<-setNames(rainbow(length(levels(batchs$plate))),levels(batchs$plate))
colors <- colors %>% mutate(by_plate = cols_plate[batchs$plate])
In [58]:
levels(batchs$tss) %>% length
373
In [59]:
# by tss
cols_tss<-setNames(rainbow(length(levels(batchs$tss))),levels(batchs$tss))
colors <- colors %>%  mutate(by_tss = cols_tss[batchs$tss])
In [60]:
# by TumorVsNormal
colors <- colors %>% mutate(by_tumorVsNormal = case_when(type == "Solid Tissue Normal" ~ "#0000FF" ,
                                            .default = "#FF0000"))
In [61]:
head(colors,n = 14) %>% tail
A data.frame: 6 × 9
idcancertypeby_typeby_cancerby_centerby_plateby_tssby_tumorVsNormal
<chr><fct><fct><chr><chr><chr><chr><chr><chr>
TCGA-GD-A2C5-01A-12R-A180-07TCGA-GD-A2C5-01A-12R-A180-07TCGA-BLCAPrimary Tumor #E69F00#FF0000#FF0000#0053FF#00A7FF#FF0000
TCGA-G2-A2EK-01A-22R-A18C-07TCGA-G2-A2EK-01A-22R-A18C-07TCGA-BLCAPrimary Tumor #E69F00#FF0000#FF0000#004AFF#00C4FF#FF0000
TCGA-BL-A13J-11A-13R-A10U-07TCGA-BL-A13J-11A-13R-A10U-07TCGA-BLCASolid Tissue Normal#009E73#FF0000#FF0000#00FFEE#00FF0B#0000FF
TCGA-ZF-AA4W-01A-12R-A38B-07TCGA-ZF-AA4W-01A-12R-A38B-07TCGA-BLCAPrimary Tumor #E69F00#FF0000#FF0000#FF0067#FF0010#FF0000
TCGA-C4-A0F6-01A-11R-A10U-07TCGA-C4-A0F6-01A-11R-A10U-07TCGA-BLCAPrimary Tumor #E69F00#FF0000#FF0000#00FFEE#00FF24#FF0000
TCGA-YF-AA3L-01A-11R-A38B-07TCGA-YF-AA3L-01A-11R-A38B-07TCGA-BLCAPrimary Tumor #E69F00#FF0000#FF0000#FF0067#FF0021#FF0000
In [62]:
by_type_n <-colors %>% group_by(cancer, type, by_type) %>% summarise(count = n()) %>% mutate(cancer_full_name = cancer_list_names[cancer])
head(by_type_n)
`summarise()` has grouped output by 'cancer', 'type'. You can override using the `.groups` argument.
A grouped_df: 6 × 5
cancertypeby_typecountcancer_full_name
<fct><fct><chr><int><chr>
TCGA-BLCAPrimary Tumor #E69F00 412Bladder Urothelial Carcinoma
TCGA-BLCASolid Tissue Normal#009E73 19Bladder Urothelial Carcinoma
TCGA-BRCAMetastatic #CC79A7 7Breast invasive carcinoma
TCGA-BRCAPrimary Tumor #E69F001111Breast invasive carcinoma
TCGA-BRCASolid Tissue Normal#009E73 113Breast invasive carcinoma
TCGA-COADMetastatic #CC79A7 1Colorectal Adenocarcinoma
In [63]:
options(repr.plot.width = 20, repr.plot.height = 10)  # Adjust width and height as needed
In [67]:
p<-ggplot(by_type_n, aes(y=cancer_full_name, x=count, fill=type)) +
    geom_bar(stat='identity', position='dodge') + 
    geom_text(aes(label=count), 
              position=position_dodge(width=1), 
              hjust=-0.25, vjust=0.5, size=4) +  # Adjust vjust and size as needed
    scale_fill_manual(values = setNames(by_type_n$by_type, by_type_n$type))+
    theme(axis.text.y = element_text(size=14),  # Larger y-axis labels
          axis.text.x = element_text(hjust=1, vjust=0.5, size=14),  # Larger x-axis labels
          axis.title.x = element_text(size=18, margin = margin(t = 10)),  # Larger x-axis title
          axis.title.y = element_text(size=18, margin = margin(r = 10)),  # Larger y-axis title
          plot.title = element_text(size=20, face="bold", hjust = 0.5),  # Larger, centered plot title
          legend.title = element_text(size=16),  # Larger legend title
          legend.text = element_text(size=14),  # Larger legend text
          plot.margin = margin(10, 10, 10, 10)) +
    ggtitle("Number and Type of Samples by Cancer Type") +
    labs(fill = "Sample Types", y="Cancer Types", x="Number of Samples")
p
No description has been provided for this image
In [71]:
# Save the plot as a PNG file
ggsave("figures/samplesTypes_per_cancer.png", plot = p, width = 10, height = 8, dpi = 300)
In [72]:
by_tumorVsNormal_n <-colors %>% mutate(by_tumorVsNormal_t = case_when(type == "Solid Tissue Normal" ~  "Normal" ,
                                            .default = "Tumor"))%>% group_by(cancer, by_tumorVsNormal_t) %>% summarise(count = n()) %>% mutate(cancer_full_name = cancer_list_names[cancer])
head(by_tumorVsNormal_n)
`summarise()` has grouped output by 'cancer'. You can override using the `.groups` argument.
A grouped_df: 6 × 4
cancerby_tumorVsNormal_tcountcancer_full_name
<fct><chr><int><chr>
TCGA-BLCANormal 19Bladder Urothelial Carcinoma
TCGA-BLCATumor 412Bladder Urothelial Carcinoma
TCGA-BRCANormal 113Breast invasive carcinoma
TCGA-BRCATumor 1118Breast invasive carcinoma
TCGA-COADNormal 41Colorectal Adenocarcinoma
TCGA-COADTumor 483Colorectal Adenocarcinoma
In [73]:
ggplot(by_tumorVsNormal_n, aes(y=cancer_full_name, x=count, fill=by_tumorVsNormal_t)) +
    geom_bar(stat='identity', position='dodge') + 
    geom_text(aes(label=count), 
              position=position_dodge(width=1), 
              hjust=-0.25, vjust=0.5, size=4) +  # Adjust vjust and size as needed 
    scale_fill_manual(values = c("Tumor"='#FF0000',"Normal"='#0000FF')) +
    theme(axis.text.y = element_text(size=14),  # Larger y-axis labels
          axis.text.x = element_text(hjust=1, vjust=0.5, size=14),  # Larger x-axis labels
          axis.title.x = element_text(size=18, margin = margin(t = 10)),  # Larger x-axis title
          axis.title.y = element_text(size=18, margin = margin(r = 10)),  # Larger y-axis title
          plot.title = element_text(size=20, face="bold", hjust = 0.5),  # Larger, centered plot title
          legend.title = element_text(size=16),  # Larger legend title
          legend.text = element_text(size=14),  # Larger legend text
          plot.margin = margin(10, 10, 10, 10)) +
    ggtitle("Number and Type of Samples by Cancer Type: Tumor Vs Normal") +
    labs(fill = "Sample Types", y="Cancer Types", x="Number of Samples")
No description has been provided for this image

Some plots for row data¶

In [76]:
# Raw counts mean expression Vs standard Deviation (SD)
#plot(rowMeans(as.matrix(merged_countdata[,-1])), rowSds(as.matrix(merged_countdata[,-1])), 
#     main='Raw counts: sd vs mean', 
#     xlim=c(0,10000),
#     ylim=c(0,10000))

Principal Component Analysis for raw count data¶

In [77]:
pca <- prcomp(t(merged_countdata[,-1]), scale = F,center = F)
In [78]:
pca_data <- as.data.frame(pca$x)

PCA Plots of Samples to Discover Batch Effects

In [84]:
plot_PCAs(pca_data, colors)
No description has been provided for this image

Principal Component Analysis (PCA) was performed to detect possible batch effects in the RNA-seq counts for 12 cancer types. The PCA plots are colored by various variables to reveal potential sources of variation:

  • Panel A: Samples are colored by cancer type. This plot reveals the distribution of samples across different cancer types and highlights clustering based on cancer type.

  • Panel B: Samples are colored by plate. This panel shows a 'clear' batch effect, with distinct clustering of samples from different plates, indicating potential technical variation related to the processing plates.

  • Panel C: Samples are colored by Tissue Source Site (TSS). This plot provides insight into how TSS affects sample distribution and helps identify any associated batch effects.

  • Panel D: Samples are colored by normal versus tumor status. This panel illustrates the separation between normal and tumor samples and helps assess whether batch effects might obscure or mimic biological differences.

A clear batch effect was identified in the plots for cancer type and plate. To address and correct these batch effects, the COMBAT algorithm will be applied. COMBAT is a statistical technique used to adjust for batch effects in high-dimensional data, ensuring that the observed biological variations are not confounded by technical artifacts. This adjustment corrects for batch effects related to both the plate and cancer types, facilitating a more accurate comparison of overall tumor data rather than specific tumor types.

Remove batch effect using COMBAT:¶

In [86]:
data4combat<-merged_countdata[,-1]
In [90]:
var.data <- apply(data4combat, 1, var)
data4combat_filtered <- data4combat[which(var.data != 0 ),]  # Maria P and Mrin mentioned: We removed entries with zero variance because including them would cause COMBAT to fail.

Important Note: The removed information needs to be reconsidered. Removing genes with zero variance before applying COMBAT and then imputing them for Pathway Activity analysis introduces an additional layer of potential inaccuracy. These genes do not vary, and their imputation could misslead the analysis.

In [85]:
head(batchs)
A data.frame: 6 × 7
projecttssparticipantsampleVialportionAnalyteplatecenter
<fct><fct><fct><fct><fct><fct><fct>
1TCGACUA3KJ01A11RA21D07
2TCGAK4A3WU01B11RA23N07
3TCGADKA3IU01A11RA20F07
4TCGAGVA40G01A11RA23N07
5TCGADKA3IN01A11RA20F07
6TCGASYA9G001A12RA38B07
In [94]:
suppressPackageStartupMessages(library(sva))
my_mod <- model.matrix(~as.factor(plate), data=batchs)
# Apply ComBat
combat <- ComBat(dat = data4combat_filtered, 
                  batch = batchs$plate, 
                  par.prior = TRUE, 
                  prior.plots = FALSE)  # Suppress plotting
Found 33303 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
Using the 'mean only' version of ComBat

Found178batches

Note: one batch has only one sample, setting mean.only=TRUE

Adjusting for0covariate(s) or covariate level(s)

Standardizing Data across genes

Fitting L/S model and finding priors

Finding parametric adjustments

Adjusting the Data


In [95]:
pca_unscaled_uncenterd_combat <- prcomp(t(combat), scale = F, center = F)
pca_data_unscaled_uncenterd_combat <- as.data.frame(pca_unscaled_uncenterd_combat$x)
plot_PCAs(pca_data_unscaled_uncenterd_combat, colors)
No description has been provided for this image

Important Remark: Before and after removing batch effects by plate, there is no clear difference observed between the datasets. This step needs to be revisited and redone to ensure proper batch effect correction. Further investigation is required to confirm the effectiveness of the batch effect removal process.

In [100]:
combat2 <- ComBat(dat=combat, batch=batchs$cancer,
                  par.prior = TRUE, 
                  prior.plots = FALSE)  # Suppress plotting
Found 2959 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
Found12batches

Adjusting for0covariate(s) or covariate level(s)

Standardizing Data across genes

Fitting L/S model and finding priors

Finding parametric adjustments

Adjusting the Data


In [101]:
pca_unscaled_uncenterd_combat2 <- prcomp(t(combat2), scale = F, center = F)
pca_data_unscaled_uncenterd_combat2 <- as.data.frame(pca_unscaled_uncenterd_combat2$x)
plot_PCAs(pca_data_unscaled_uncenterd_combat2, colors)
No description has been provided for this image

Important Remark: The differences between tumor and control classes are not clearly visible after batch effect correction. This raises concerns that we may be losing important biological effects during the batch effect cleaning process. Further review and re-evaluation are needed to ensure that the biological distinctions between tumor and control samples are accurately preserved and not inadvertently masked by the correction process.

Normalize Data using TMM¶

In [118]:
my_DGEList <- DGEList(counts=combat2, group = colors$type)
dge <- calcNormFactors(my_DGEList)
logCPM <- cpm(dge, log=TRUE)
Error in DGEList.default(counts = combat2, group = colors$type): Negative counts not allowed
Traceback:

1. DGEList(counts = combat2, group = colors$type)
2. DGEList.default(counts = combat2, group = colors$type)   # at line 1-2 of file /tmp/Rtmp27SShG/R.INSTALL1266ce47ac7a07/edgeR/R/DGEList.R
3. stop("Negative counts not allowed")   # at line 13 of file /tmp/Rtmp27SShG/R.INSTALL1266ce47ac7a07/edgeR/R/DGEList.R

PCAs after normalization¶

In [ ]:
pca_normalized_unscaled_uncenterd <- prcomp(t(logCPM), scale = FALSE, center = FALSE)
pca_data_normalized_unscaled_uncenterd <- as.data.frame(pca_normalized_unscaled_uncenterd$x)
In [ ]:
pca_normalized_unscaled_centerd <- prcomp(t(logCPM), scale = FALSE, center = TRUE)
pca_data_normalized_unscaled_centerd <- as.data.frame(pca_normalized_unscaled_centerd$x)
In [ ]:
pca_normalized_scaled_uncenterd <- prcomp(t(logCPM), scale = TRUE, center = FALSE)
pca_data_normalized_scaled_uncenterd <- as.data.frame(pca_normalized_scaled_uncenterd$x)
In [ ]:
pca_normalized_scaled_centerd <- prcomp(t(logCPM), scale = TRUE, center = TRUE)
pca_data_normalized_scaled_centerd <- as.data.frame(pca_normalized_scaled_centerd$x)
PCA of row data:¶
In [ ]:
plot_PCAs(pca_data_unscaled_uncenterd_combat2, colors)
PCA of normalized data, without scaling or centering:¶
In [ ]:
plot_5_PCAs(pca_data_normalized_unscaled_uncenterd, colors)
PCA of normalized data, without scaling but with centering:¶
In [ ]:
plot_5_PCAs(pca_data_normalized_unscaled_centerd, colors)
PCA of normalized data, with scaling but without centering:¶
In [ ]:
plot_5_PCAs(pca_data_normalized_scaled_uncenterd, colors)
PCA of normalized data, with scaling and centering:¶
In [ ]:
plot_5_PCAs(pca_data_normalized_scaled_centerd, colors)

Other plots for tests (to be removed)¶

In [105]:
res.pca <- PCA(t(merged_countdata[,-1]), graph = FALSE, ncp = 2, scale.unit = F)
In [106]:
fviz_pca_ind(res.pca, geom = "point",col.ind = colors$cancer, addEllipses = T)
No description has been provided for this image
In [108]:
fviz_pca_ind(res.pca, geom = "point",col.ind = colors$type)
No description has been provided for this image
In [109]:
mypca = mixOmics::pca(t(merged_countdata[,-1]), ncomp = 2, center = FALSE, scale = FALSE)
In [110]:
plot(mypca)
No description has been provided for this image
In [111]:
p1<-plotIndiv(mypca, comp = 1:2, 
          #ind.names = batchs$participant,
          pch= batchs$group,
          group = colors$cancer,
          # graphical parameters
          col = unique(colors$by_cancer), style = "ggplot2", 
          legend = TRUE, legend.position = "right", 
          legend.title = "Cancer type", ellipse = TRUE, 
          ellipse.level = 0.95, centroid = FALSE)
No description has been provided for this image
In [112]:
p2<-plotIndiv(mypca, comp = 1:2, 
          #ind.names = batchs$participant,
          pch= batchs$group,
          group = batchs$center,
          # graphical parameters
          col = unique(colors$by_center), style = "ggplot2", 
          legend = TRUE, legend.position = "right", 
          legend.title = "Center", ellipse = TRUE, 
          ellipse.level = 0.95, centroid = FALSE)
No description has been provided for this image
In [113]:
p3<-plotIndiv(mypca, comp = 1:2, 
          #ind.names = batchs$participant,
          pch= batchs$group,
          group = batchs$plate,
          # graphical parameters
          #col = unique(colors$by_plate), 
              style = "ggplot2", 
          legend = F, legend.position = "right", 
          legend.title.pch = "Plate", ellipse = F, 
          ellipse.level = 0.95, centroid = FALSE)
No description has been provided for this image
In [114]:
p4<-plotIndiv(mypca, comp = 1:2, 
          #ind.names = batchs$participant,
          pch= batchs$group,
          group = batchs$tss,
          # graphical parameters
          col = unique(colors$by_tss), style = "ggplot2", 
          legend = F, legend.position = "right", 
          legend.title = "Center", ellipse = F, 
          ellipse.level = 0.95, centroid = FALSE)
No description has been provided for this image
In [115]:
p5<-plotIndiv(mypca, comp = 1:2, 
          #ind.names = batchs$participant,
          pch= batchs$group,
          group = colors$type,
          # graphical parameters
          col = unique(colors$by_type), style = "ggplot2", 
          legend = TRUE, legend.position = "right", 
          legend.title = "Center", ellipse = TRUE, 
          ellipse.level = 0.95, centroid = FALSE)
No description has been provided for this image